Data Science Project: Ethnic Bias of Standardized Testing

By: Mandy Yu, Fall 2021

Github: https://github.com/yumandee

LinkedIn: https://www.linkedin.com/in/mandy-yu-378a49162/

The institutionalized Standardized Testing of New York State has inaccurately reflected students’ mental capacities. This project explores whether or not there is a correlation between the New York Statewide English Language Arts and Math Exam results and student ethnicity across all school districts. In this project, the ethnic composition of test results will be explored to expose the inaccuracy and unfairness of standardized testing.

I examined the top scoring schools in NYC and took a closer look at their ethnic compositions. A correlation ranker revealed that as the population of certain ethnicities (typically non-minority groups such as White or Asian students) increased, so too did the mean scale score. The opposite also applies where as the population of certain ethnicities (typically minority groups such as Black or Hispanic students) decreased, so too did the mean scale score.

Datasets

This project uses data from OpenData NYC on New York State Standardized Testing results from 2013-2018.

The ELA and Math statewide tests evaluate a student beginning from 3rd grade until 8th grade on their ability to solve Common Core questions. Students are evaluated on a scale from 1-4, 1 being the lowest and 4 being the highest. These datasets provide information on schools in New York City and their reported test scores. The information includes the percentage and number of students who received a certain exam score. Information is provided for every grade at the listed school.

This dataset provides information on the demographic snapshot of the same schools listed in the statewide exam result datasets. The information provided includes the total number of students in each grade, gender composition, ethnic composition and economic composition of students. Only the ethnic composition data is utilized in this project as the ratio of female to male remained around an even split of 50% and the timespan of the project would not allow for an in-depth exploration of the economic composition in relation to the exam scores.

This dataset was used to create a map of the school locations.

Techniques

To explore the datasets and test if there existed an ethnic bias in standardized testing, I utilized my knowledge of pandas to effectively clean and gather data. With my previous experience in json files, I was able to get data from the OpenData NYC website without downloading csv files. I used a correlation ranker and scatter plots to explore if there existed a correlation between ethnicity and exam scores.

Resources

Datasets:

Code

Gathering and Cleaning Data

This project explores data on New York Statewide English Language Arts and Math exams from the 2012-13 to 2017-18 school years. In the 2018-19 school year, the statewide exam was changed from three days of testing to two. Thus, data from the 2018-19 school year was omitted. The project also utilizes geographical data to create a mapped visualization of the correlation between exam results and ethnicity.

The NY Statewide ELA and Math Exams are graded on a scale of 1-4 where:

ELA Test Results from 2013-2018 by District

Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2013-2019-English-Language-Arts-ELA-Test-Results-S/gu76-8i7h

Math Test Results from 2013-2018 by District

Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2013-2019-Math-Test-Results-School-SWD-Ethnicity-G/74ah-8ukf

To effectively work with these number values, we need to convert columns to the appropriate datatypes. This applies to both math and ela dataframes.

Ethnic Demographics of NYC Public Schools from 2013-2018

Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2013-2018-Demographic-Snapshot-School/s52a-8aq6

I utilized apply() to extract the district for all schools for the ELA, Math, and Demographic dataframes. This will be used to visualize by district.

NYC Public School Location Data from 2017-18

Data obtained from OpenData NYC: https://data.cityofnewyork.us/Education/2017-2018-School-Locations/p6h4-mpyy

After cleaning the data to extract only the necessary columns and renaming them to consist with the other dataframes, the location column contains coordinates that need to be extracted.

If we take a look closer at one of the entries, it appears to be a Python dictionary. With this, indexing and extracting the columns needed can be done easily.

Analysis

The average mean scale score and average count/percentages of students were calculated for each grade in each district.

ELA Average Test Scores by District and Grade

Math Average Test Scores by District and Grade

The percentage of students who received a satisfactory score (3 or 4) on the statewide exam is reflected in % Level 3+4 column. Let's take a look at the range of values for every grade in all districts.

In both the ELA and math averages, there is a high difference between districts. In the ELA data, one district had 20.78% of their third grade students from 2013-18 receive a satisfactory score of 3 or 4. Another district had 67.36% of their third grade students from 2013-18 receive a 3 or 4. This is nearly a 50% difference. In the math data, one district had 15.15% of their 5th grade students from 2013-18 receive of 3 or 4 and another had 73.85%. That is nearly a 60% difference.

With this information, it is evident there is a difference between districts. In this project, I explore how ethnicity possibly influences the averages. So, let's take a look at each districts' ethnic composition.

First, let's convert the datatypes of the appropriate columns to floats.

Now, let's average the demographics from 2013-18 for each district.

Bottom 5 Districts and Ethnicity

Let's observe the bottom 5 districts based on their mean scale score for 5th graders and explore if there exists a potential correlation between ethnicity and score. In the dataframe below, the lowest 5 scoring schools are shown. In four of the lowest scoring districts, the majority of the student population was greater than 65% Hispanic. In District 5, 50.71% of the student population was Black.

Top 5 Districts and Ethnicity

Let's observe the top 5 districts based on their mean scale score for 5th graders and again explore if there exists a potential correlation between ethnicity and score. In the dataframe below, the highest 5 scoring schools are shown. District 26, 20, and 25 have majority Asian students (nearly 50%). District 2 shows a somewhat even distribution between all ethnicities. District 3 shows an even distribution between Black, Hispanic, and White students.

With the bottom and top 5 scoring districts, there is a correlation between mean scale score and ethnicity. However, the correlation is not strong as minority groups are capable of scoring high on the statewide tests.

Visualizations

Let's confirm if there exists a correlation between ethnicity and score using the Pearson correlation ranker.

Let's take a close look at the correlation between ethnicity and percentage of students in the district who received a satisfactory grade of 3 or 4.

% Asian vs % Level 3+4

There is a strong positive correlation of 0.751451 between the percentage of Asian students and students with 3 or 4. This indicates that as the percentage of Asian students increases, the percentage of students who receive 3 or 4 also increases. This means the population of Asian students possibly influences the percentage of students who receive a satisfactory score. This correlation suggests that a higher percenteage of Asian students results in more students who receive a 3 or 4.

% Black vs % Level 3+4

There is a weak negative correlation of -0.448200 between the percentage of Black students and students with 3 or 4. This indicates that as the percentage of Black students increases, the percentage of students who receive 3 or 4 decreases. This indicates a possible influence of ethnicity on the percentage of students who receive a satisfactory score. This correlation suggests that a higher percentage of Black students results in less students who receive 3 or 4.

Scatter Plots

% Asian vs % Level 3+4

% Black vs % Level 3+4

% White vs % Level 3+4

% Hispanic vs % Level 3+4

These scatter plots suggest that for minority groups, the schools with higher populations of minority students, the lower the average statewide test score. While this may not be directly influenced by the ethnicity, there is a suggestion that there exists this influence.

School District Zones